-
Notifications
You must be signed in to change notification settings - Fork 75
enh(blog): Add blog post on generative AI peer review policy #734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@all-contributors please add @elliesch for review, blog |
|
I've put up a pull request to add @elliesch! 🎉 |
|
@all-contributors please add @elliesch for blog, review |
|
cc @willingc in case you are interested in this blog post!! no pressure!! |
This blog post outlines pyOpenSci's new peer review policy regarding the use of generative AI tools in scientific software, emphasizing transparency, ethical considerations, and the importance of human oversight in the review process.
Co-authored-by: Jed Brown <[email protected]>
Co-authored-by: Jed Brown <[email protected]>
Co-authored-by: Jed Brown <[email protected]>
Co-authored-by: Jed Brown <[email protected]>
Co-authored-by: Jed Brown <[email protected]>
willingc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this! A few grammar suggestions.
| * Using LLM output verbatim could violate the original code's license | ||
| * You might accidentally commit plagiarism or copyright infringement by using that output verbatim in your code | ||
| * Due diligence is nearly impossible since you can't trace what the LLM "learned from" (most LLM's are black boxes) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Using LLM output verbatim could violate the original code's license | |
| * You might accidentally commit plagiarism or copyright infringement by using that output verbatim in your code | |
| * Due diligence is nearly impossible since you can't trace what the LLM "learned from" (most LLM's are black boxes) | |
| * Using LLM output verbatim could violate the original code's license | |
| * You might accidentally commit plagiarism or copyright infringement by using that output verbatim in your code | |
| * Due diligence is nearly impossible since you can't trace what the LLM "learned from" (most LLMs are black boxes) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "verbatim" is being leaned on too much here. An LLM can produce verbatim copies of its corpus, but the standard in copyright law is not limited to verbatim copies. If the process involved copying at any stage, refactoring can only obfuscate. The "substantial similarity" standards in copyright law are used as circumstantial evidence of process. Modifying the result by paraphrasing/refactoring is concealing the evidence (and thus reduces the likelihood of being caught), but does not make the process legal. I think we should be careful to not spread that misconception to readers.
Co-authored-by: Carol Willing <[email protected]>
Co-authored-by: Carol Willing <[email protected]>
Co-authored-by: Carol Willing <[email protected]>
Co-authored-by: Carol Willing <[email protected]>
Co-authored-by: Carol Willing <[email protected]>
Co-authored-by: Carol Willing <[email protected]>
| LLMs are trained on large amounts of open source code; most of that code has licenses that require attribution. | ||
| The problem? LLMs sometimes spit out near-exact copies of that training data, but without any attribution or copyright notices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| LLMs are trained on large amounts of open source code; most of that code has licenses that require attribution. | |
| The problem? LLMs sometimes spit out near-exact copies of that training data, but without any attribution or copyright notices. | |
| LLMs are trained on large amounts of open source code that is bound by various licenses, many of which require attribution. When an LLM generates code, it may reproduce verbatim output or patterns or structures from that training data—but without attribution or copyright notices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to include more than just verbatim ... that fundamentally, the patterns as well are licensed.
Also wondering here - let's say that i produce some code totally on my own that happens to match a pattern of some code with a license that requires attribution. What happens there? (if my production code is legitimately developed on my own and the pattern just happens to be a great one that others use too, and maybe I've even seen it before, but I'm not intentionally copying).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as copyright law is concerned, that's exactly the scenario where the substantial similarity standard would be applied. The more substantial the copying and the more closely in time that you would have observed the original, the more likely your work would be found to have substantial similarity and to be infringing. Protecting against that ambiguity is why clean-room design exists.
| * In some cases, simplifying language barriers for participants in open source around the world | ||
| * Speeding up everyday workflows | ||
|
|
||
| Some contributors also believe these products open source more accessible. And for some, maybe they do. However, LLMs also present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Some contributors also believe these products open source more accessible. And for some, maybe they do. However, LLMs also present | |
| Some contributors also believe these products make open source more accessible. And for some, maybe they do. However, LLMs also present |
|
|
||
| Please **don’t offload vetting of generative AI content to volunteer reviewers**. Arrive with human-reviewed code that you understand, have tested, and can maintain. | ||
|
|
||
| ### Watch out for licensing issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section may benefit from another round of editing.
I think it's useful to think of the types of content that may or may not be copyrighted. E.g., refactoring your test suite is unlikely to get you in trouble, but implementing a new algorithm is.
But, in the latter case you probably wouldn't use an LLM wholesale, because as you say above you'd need to understand the algorithm to vet it, and then it'd probably be easier to construct it yourself.
| If you are a maintainer or a contributor, some of the above can apply to your development and contribution process, too. | ||
| Similar to how peer review systems are being taxed, rapid, AI-assisted pull requests and issues can also overwhelm maintainers too. To combat this: | ||
|
|
||
| * Open an issue first before submitting a pull request to ensure it's welcome and needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Open an issue first before submitting a pull request to ensure it's welcome and needed | |
| * Open an issue first before submitting a pull request to ensure it's welcome and needed. |
| * Keep your pull requests small with clear scopes. | ||
| * If you use LLMs, test and edit all of the output before you submit a pull request or issue. | ||
| * Flag AI-assisted sections of any contribution so maintainers know where to look closely. | ||
| * Be responsive to feedback from maintainers, especially when submitting code that is AI-generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quite like @choldgraf's framing of this, which resonates with my own: we don't care as much how the code came about, but what we want is to have a conversation with a human. Hence the focus on you understanding your PR, being able to respond to PR feedback (without needing to bring an LLM to interpret for you, etc.).
I've now come across the situation where, when I ask people something, they feed it into an LLM and paste the answer back to me 😅 Drives me nuts.
This blog post outlines pyOpenSci's new peer review policy regarding the use of generative AI tools in scientific software, emphasizing transparency, ethical considerations, and the importance of human oversight in the review process.
It is codeveloped by the pyOpenSci community and relates to a discussion here:
pyOpenSci/software-peer-review#331